Jung et al. BMC Proceedings 201 1, 5(Suppl 9):S103 
http://www.biomedcentral.eom/1753-6561/5/S9/S103 



Proceedings 



PROCEEDINGS Open Access 



Identification of multiple rare variants associated 
with a disease 

Jeesun Jung 1,2 * Jessica Dantzer 2 , Yunlong Liu 1,2,3 

From Genetic Analysis Workshop 17 
Boston, MA, USA. 13-16 October 2010 



Abstract 

Identifying rare variants that are responsible for complex disease has been promoted by advances in sequencing 
technologies. However, statistical methods that can handle the vast amount of data generated and that can 
interpret the complicated relationship between disease and these variants have lagged. We apply a zero-inflated 
Poisson regression model to take into account the excess of zeros caused by the extremely low frequency of the 
24,487 exonic variants in the Genetic Analysis Workshop 17 data. We grouped the 697 subjects in the data set as 
Europeans, Asians, and Africans based on principal components analysis and found the total number of rare 
variants per gene for each individual. We then analyzed these collapsed variants based on the assumption that rare 
variants are enriched in a group of people affected by a disease compared to a group of unaffected people. We 
also tested the hypothesis with quantitative traits Q1, Q2, and Q4. Analyses performed on the combined 697 
individuals and on each ethnic group yielded different results. For the combined population analysis, we found 
that UGT1A1, which was not part of the simulation model, was associated with disease liability and that Fill, 
which was a causal locus in the simulation model, was associated with Q1. Of the causal loci in the simulation 
models, Fill and KDR were associated with Q1 and VNN1 was correlated with Q2. No significant genes were 
associated with Q4. These results show the feasibility and capability of our new statistical model to detect multiple 
rare variants influencing disease risk. 



Background 

The identification of common variants associated with a 
disease has been successful through the use of genome- 
wide association studies (GWAS). However, most of the 
associated single nucleotide polymorphisms (SNPs) have 
small effect sizes and small proportions of heritability 
[1]. In addition, some GWAS have failed to detect dis- 
ease causal variants because of the strong assumption 
that common variants contribute to an increase in risk 
of common diseases (the common disease/common var- 
iant hypothesis) [2]. Recently several rare variants have 
been identified that confer a substantial risk for autism, 
mental retardation, and schizophrenia [1]. These obser- 
vations support a hypothesis that rare variants could be 
the primary drivers of common diseases (the common 
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disease/rare variant hypothesis). This hypothesis 
assumes that a significant proportion of the inherited 
susceptibility to relatively common human disease may 
be caused by the accumulation of the effects of a series 
of low-frequency variants acting dominantly or addi- 
tively to increase the relative risk for disease [2]. 

GWAS have been designed to achieve statistical power 
for variants occurring in more than 5% of the general 
population, and they provide little information about 
relatively common variants with frequencies between 1% 
and 5%. However, recent advances in next-generation 
sequencing technologies and ventures, such as the 1000 
Genomes Project, allow for the introduction of novel 
rare variants that most likely occur in less than 5% (or 
even in less than 1%) of one or more major human 
populations. Although knowledge of these novel rare 
variants can be used in association studies of common 
diseases, statistical analyses are challenging because the 
ordinary SNP-by-SNP methods that are suited for 
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GWAS have limited capacity to detect rare variant asso- 
ciation because of the extremely low frequency of each 
variant [3]. Furthermore, statistical power is dramatically 
reduced when we take into account correction for mul- 
tiple tests. Therefore one of the key challenges in rare 
variant association studies is how to capture (i.e., group) 
the variants by genomic region to overcome the reduc- 
tion in power experienced in ordinary SNP-by-SNP 
methods. 

In this paper, we collapse rare variants within a gene 
in two ways: first, using rare variants of all SNPs, and, 
second, using only rare variants of nonsynonymous 
SNPs to see the functional effect on disease traits. We 
then test for association of the rare variants with disease 
traits under the hypothesis that the number of rare var- 
iants within a gene is correlated either positively or 
negatively with the traits. To perform this test, we apply 
a novel statistical approach, called zero-inflated Poisson 
regression models, which provides flexibility for the 
excess of zeros caused by the extremely low frequency 
of the variants [4]. We test 3,205 genes under two sce- 
narios: one including a single group made up of all 697 
subjects after adjusting for population substructure and 
the other involving separating the subjects into three 
ethnic groups based on principal components analysis 
and geographic information. Results from these analyses 
show the feasibility of using this new statistical model to 
take into account the excess of zeros and to detect mul- 
tiple rare variants responsible for disease risk. 

Methods 

Data 

The genotypes for 24,487 exonic SNPs from 3,205 genes 
included in the 1000 Genomes Project were distributed 
by Genetic Analysis Workshop 17 (GAW17) [5]. The 
SNPs consist of synonymous SNPs and nonsynonymous 
SNPs. Data for 697 individuals was provided for analysis 
and consisted of seven population groups: European 
Americans, Tuscans (from Italy), Yoruba, Luhya (from 
Kenya), Han and Denver Chinese, and Japanese. For 
phenotypes, 200 replicate simulations were carried out 
to give disease status (1 = affected, 0 = not affected) and 
three quantitative traits (Ql, Q2, Q4) for each indivi- 
dual. Information such as smoking status, sex, and age 
of each individual was also available. 

Collapsing rare variants 

Analysis of single SNPs with extremely low frequency 
leads to a reduction in statistical power, especially when 
correcting for multiple comparisons. To overcome this 
obstacle, we consider collapsing multiple rare variants 
into a single gene-based variant. The advantages of this 
strategy are that it reduces the number of tests and that 
it enhances the variability of the variants as mutational 



signals [6] . Assume that N is the total number of indivi- 
duals, m is the total number of genes, and nj is the total 
number of SNPs on the ;th gene. We define an indicator 
variable for the genotype of the /<th SNP of the yth gene 
for the ith individual as follows: 



ijk 



for aa, 
for aA, 
for AA, 



(1) 



where a is the minor allele of the /cth SNP, i - 1, 
N, j - 1, m, and k = 1, nj. This indicator variable 
describes the presence of a rare variant equivalently to: 



1 if rare variants (e.g., aa or aA) are present, 

0 otherwise. 



(2) 



We then sum over all SNPs (I< = 1, 2, nj) on 
the yth gene: 



V: 



ijk- 



(3) 



Statistical models 

For each gene, we applied a zero-inflated Poisson model 
based on the results of collapsing the variants. Because 
our method is a gene-based approach, subscript j is 
omitted for convenience. The statistical models are: 



where: 



0 



with probability p { , 



Poissonf/ij) with probability (1 - pj), 



(4) 



> = 1 



V,,; 



ijk 



(5) 



is the total number of rare variants at the yth gene for 
the ith subject. Parameters p = (p lf p 2 , Pn) T an d ^ = 
(^i, (A 2 > I* n) T are modeled by means of canonical 
link generalized linear models as: 

10 

logfo) = p 0 + + fi 2 Z u + p 3 z 2i + p 4 z 3i + ]T PCS ip , (6) 

p=l 



log 



1 -Pi 



= a 0 +a l X i , 



(7) 



where X t is an indicator variable given by X t - 1 for a 
case individual and X t - 0 for a control individual in 
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each replicate data set and Z u , Z 2 i, and Z 3i are age, sex, 
and smoking status, respectively. PCS ip is the principal 
component score obtained by multidimensional scaling 
(MDS) analysis of identical-by-state pairwise distances 
in PLINK. For a quantitative trait analysis, the statistical 
model uses the same framework but replaces X t with 
the quantitative trait (Ql, Q2, Q4) value for each sub- 
ject We test Pi = 0 for the association of each trait. All 
statistical analyses were performed using SAS software. 

Analysis procedure 

Association studies are sensitive to population stratifica- 
tion. The results often vary as a result of differences in 
minor allele frequency, linkage disequilibrium patterns, 
causal pathways, or environmental exposures. To mini- 
mize the effects of such variability, we must examine 
the relationship between genetic variants and disease 
traits within ethnicities. For the GAW17 data set we 
analyzed, we performed an MDS analysis based on all 
the available exonic SNPs using PLINK to estimate 
population substructure and to detect outliers. The 
results of the MDS analysis classified individuals as 
belonging to one of three major ethnic groups: Eur- 
opean Americans and Tuscans in Italy were grouped 
together as Europeans, Yoruba in Nigeria and Luhya in 
Kenya were grouped as Africans, and Japanese and Chi- 
nese in both Beijing and Denver were grouped as 
Asians. These populations were both geographically and 
genetically similar. In addition, we performed principal 
components analysis on all exonic SNPs using Eigenstrat 
[7] and obtained the first 10 main eigenvectors, which 
can track membership in populations with 99.9% accu- 
racy [7]. We control the 10 eigenvectors for subtle sub- 
population stratification caused by the seven original 
populations in the statistical model. 

We then performed genotype-phenotype association 
analysis under two scenarios: (1) analyzing the three 
ethnic groups separately to account for different minor 
allele frequencies among the SNPs and (2) analyzing all 
697 individuals together in one large group to compare 
the results. We adjusted for the first 10 eigenvectors in 



both cases to control for subtle subpopulations, as noted 
previously. For each subject in a given group, we first 
indicated the presence of the minor allele for each SNP 
and then collapsed multiple rare variants within each 
gene for both the all-exonic-SNPs model and the nonsy- 
nonymous-SNPs-only model, as previously described. 
We applied the zero-inflated Poisson model to identify 
significant genes associated with the disease trait and 
quantitative traits, ^-values for 3,205 autosomal genes 
were obtained from each analysis. The estimated Rvalue 
threshold was 1.56 x 10 -5 (= 0.05/3,205) after control- 
ling for multiple tests. Next, we analyzed the 200 repli- 
cates provided in the data set to calculate the power at 
a significance level of 0.05 and 1.56 x 10~ 5 in order to 
compare our results with the true genes in the GAW17 
answers. 

Results 

MDS analysis revealed three ethnic groups formed by 
the seven populations in the data set. Five outliers were 
found among the 697 subjects and were subsequently 
removed, leaving the total number of individuals used in 
the analysis at 692. Broken down by ethnic group, we 
found 215 Africans, 321 Asians, and 156 European indi- 
viduals (data not shown) [7]. To determine whether 
genetic structure differed across the ethnic groups, we 
estimated the minor allele frequencies of the SNPs 
found on two randomly selected genes. Table 1 shows 
how the minor allele frequencies and rare variants 
found in each gene vary greatly across the three ethnic 
groups. To account for these differences, we collapsed 
the rare variants within a gene and performed the analy- 
sis for each ethnic group separately. 

Table 2 shows genes that were significantly associated 
with disease liability and quantitative traits Ql, Q2, and 
Q4 based on the pooled data set of 692 individuals. For 
each of these genes, the table displays the percentage of 
replicates, out of 200, whose ^-values were less than 
0.05 before adjusting for multiple tests or less than 1.56 
x 10 -5 after adjusting for multiple tests. Genes that 
were significant at the 5% significance level in at least 



Table 1 Minor allele frequencies of SNPs within two randomly selected genes 


Gene 


Chromosome 


SNP 


Function 


Minor allele 




Minor allele frequency 














All* 


Europeans 


Asians 


Africans 


CFH 


1 


C1S8678 


Nonsynonymous 


A 


0.341 


0.032 


0.579 


0.982 






C1S8682 


Synonymous 


C 


0.001 


0 


0 


0.005 






C1S8684 


Synonymous 


A 


0.311 


0.923 


0.106 


0.964 






C1S8686 


Nonsynonymous 


C 


0.325 


0.942 


0.134 


0.959 


SMAD7 


18 


C18S2642 


Synonymous 


C 


0.006 


0 


0.022 


0 






C18S2643 


Synonymous 


A 


0.009 


0 


0.034 


0 






C18S2648 


Synonymous 


T 


0.006 


0 


0.028 


0 



* All 697 individuals were used to calculate the minor allele frequency. 
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Table 2 Genes associated with disease liability, Ql, and Q2 for the pooled populations at the significance levels of 
0.05 and 1.56 x 10 -5 



Trait 


All SNPs (nonsynonymous and synonymous) 




Nonsynonymous SNPs 




Gene 


Replicates significant 

at n n^ (o/ n \ a 

aX U.Uj ^ /0) 


Replicates significant at 

i c/; v in -5 (o/ n \ b 

1 .DO A IU ^ /O) 


Gene 


Replicates significant 

oX U.Uj ^ /0) 


Replicates significant at 

i c/; v in -5 (o/ n \ b 

1 . DO A IU V '0/ 


Plicoaco 1 i a h i 1 i t\ / 
Lylocdoc lldUllliy 

(case-control) 


r~vci TD~> 
L YjL 1 Hz 


77.5 


68.5 


r\l\n 1 U 1 


72.5 


36.5 




MGEA5 


84 


79.5 


AQP3 


74 


66.5 




NF2 


85.5 


76.5 


CAPNS1 


74 


64 




PTPN11 


83.5 


71 


EIF6 


84 


77.5 




UGT1A1 


72.5 


55 


FLT1 
LYPD6 
PCDHGA4 
PRKCG 
UGT1A1 


72.5 
77.5 
77 
70 
71.5 


5 
72.5 
69.5 

65 
55.5 


Q1 


CYSLTR2 


62.5 


45 


FLT1 


96.5 


97.5 




FLT1 


100 


92 


SLC2A13 


69.5 


0 




PTPN11 


64.5 


42.5 








Q2 


PIK3R1 


66.5 


29 


VNN1 


62 


0 



PTGI5 62.5 41.5 

PTPN11 71.5 51.5 

Listed genes were significant in at least 70% of replicates. Genes in boldface are the true causal genes in the simulation model. 

a Percentage of replicates in which the gene was significantly associated with disease liability, Q1, or Q2 at the significance level of 0.05. 

b Percentage of replicates in which the gene was significantly associated with disease liability, Q1, or Q2 at the significance level of 1.56 x 10~ 5 



70% of replicates are displayed in Table 2. Genes in 
boldface are causal genes in the simulation model, and 
those not in boldface are examples of false-positive asso- 
ciation that were seen frequently across replicates. 

Through the disease liability analysis, we identified 
four unique genes with a replication rate of more than 
70% when using all exonic SNPs and eight unique genes 
when using only nonsynonymous SNPs. UGT1A1, 
which was not part of the simulation model, was 
detected in both cases. Analysis of quantitative traits Ql 
and Q2 identified five unique genes when including all 
exonic SNPs and two unique genes when including only 
nonsynonymous SNPs. FLT1, one of the true causal 
genes in the simulation model, was associated with Ql 
in both cases with a replication rate of more than 90% 
after adjusting for multiple tests. PTPN11, which was 
not a true causal gene in the model, was identified in 
association with disease liability, Ql, and Q2 when ana- 
lyzing all exonic SNPs but not when analyzing only non- 
synonymous SNPs. Also, based on a replication rate 
criterion of 70%, we found no significant genes asso- 
ciated with Q4 in any of the scenarios. 

Table 3 lists the genes that are significantly associated 
with disease liability for each ethnic group when using 
all exonic SNPs. As illustrated, many of the genes that 
were significantly associated in a high percentage (>70%) 
of replicates in one particular ethnic group were not 
detected in the combined sample. All these associations 
were false-positive results. Tables 4 and 5 provide the 



power to detect significant association of genes that 
were in the GAW17 simulation models. Three genes 
(FLT1 for Ql and disease liability, KDR for Ql, and 
VNN1 for Q2) were detected with moderate to high 
power in either the all-exonic-SNPs or the nonsynon- 
ymous-SNPs-only analysis. PTK2 was associated with 
disease liability with a power of 50% in the nonsynon- 
ymous-SNPs-only analysis. 

Discussion and conclusions 

It is well known that not all coding rare variants within 
coding portions of genes are causative, and grouping 
them together regardless of their functional effect may 
produce false-positive errors. Therefore successful use of 
collapsing strategies increases the power to detect 
potential causative variants in association studies. We 
have applied a zero-inflated Poisson regression model to 
two separate collapsing methods: one in which we used 
all available exonic SNPs and the other in which we lim- 
ited the data to nonsynonymous SNPs alone. Based on 
these analyses, we found that four unique genes were 
associated with disease liability and five were in associa- 
tion with quantitative traits Ql and Q2 when collapsing 
all exonic SNPs within a gene. When collapsing only the 
nonsynonymous SNPs within a gene, we found eight 
unique genes in association with disease liability and 
two unique genes from Ql and Q2. UGT1A1 and FLT1 
were the only genes detected by both analyses. The dif- 
ferences between these results show how the detection 
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Table 3 Genes significantly associated with disease liability for each ethnic group at the significance levels of 0.05 and 
1.56 x 10" 5 





Africans (n 


= 215) 




Europeans (n 


= 156) 




Asians (n = 


321) 


Gene 


Replicates 
significant at 
0.05 (%) a 


Replicates 
significant at 1.56 

x 10" 5 (%) b 


Gene 


Replicates 
significant at 
0.05 (%) a 


Replicates 
significant at 1.56 

x 10" 5 (%) b 


Gene 


Replicates 
significant at 
0.05 (%) a 


Replicates 
significant at 1.56 

x 10" 5 (%) b 


CYSLTR2 


84 


71 


CCNH 


72 


57 


CHEK1 


95.5 


95.5 


FER 


77 


50.5 


CLPTM1 


71 


38 


FU43860 


72.5 


64 


FNDC3A 


73 


68 


CYP17A1 


74 


65.5 


IL4I1 


74.5 


53.5 


NTRK1 


71 


59.5 


GDNF 


83 


22 


LRRC20 


74.5 


69.5 


OR8J3 


71.5 


51.5 


0R13A1 


86.5 


82.5 


NF2 


87.5 


67.5 


PTPN11 


86.5 


74 








SPINK2 
ST5 


89.5 
84.5 


47 
79.5 



Listed genes were significant in at least 70% of replicates. None of the genes were true causal genes in the simulation model. Analysis is for all SNPs. 

a Percentage of replicates in which the gene was significantly associated with disease liability at the significance level of 0.05. 

b Percentage of replicates in which the gene was significantly associated with disease liability at the significance level of 1.56 x 10~ 5 . 



of rare variant association depends on the collapsing 
strategy used. 

In addition to differences in the collapsing strategy, 
our approach to analyzing each ethnic group versus the 
combined sample group shows how variations in popu- 
lation can affect the results. Six genes on average were 
detected from each ethnic group. Few of the same 
genes, whether they were false-positive associations or 
true associations, were detected by the analysis of the 
combined sample group versus the individual ethnic 
groups. From the results of our analyses, we can also 
see that the three ethnic groups have a wide range of 
allele frequencies for some SNPs. Variants within the 
same gene that have different allele frequencies should 
not be collapsed into the same group to avoid a loss of 
statistical power. The GAW17 simulation model did not 
take population allele frequency differences into 
account, and this could be the cause of much of the dis- 
crepancy in the detection of genes using various collap- 
sing approaches or analytic strategies. Therefore 
ethnicity-specific analysis might be a better approach to 
performing association studies of genome sequence data 
than using the full combined sample group because 



causal rare variants may have different frequencies in 
different ethnic groups in real human populations. 

In using our novel approach to identify rare variants 
associated with disease traits, we found that 5 causal 
genes (out of 35) in the simulation models were 
detected with a reasonable power. However, we did not 
detect many of the genes in the GAW17 answers as 
being significant, for several reasons. First, our statistical 
model does not take into account any pathway informa- 
tion. In the simulation model the genes associated with 
Ql are from the vascular endothelial growth factor 
(VEGF) pathway, and the genes associated with Q2 are 
from the cardiovascular disease risk and inflammation 
pathways, both of which influence disease liability. The 
power to detect true causal variants might have been 
improved if all rare variants within a pathway were col- 
lapsed together into a single pathway variant indicator. 
Second, the effect size of many of the "causal" variants 
was quite small, thereby limiting power. Third, the 
assumptions made by a zero-inflated Poisson model for 
the collapsed rare variants might not be suitable for 
some genes with only a few SNPs. In that case, one 
might need to apply a zero-inflated binomial regression 



Table 4 Power to detect causal genes (out of 35 genes) in the GAW17 simulation models using the combined sample 

Trait All SNPs (nonsynonymous and synonymous) Nonsynonymous SNPs 



Gene Power calculated at Power calculated at Gene Power calculated at Power calculated at 

significance level 0.05 (%) significance level 1.56 x 10~ 5 significance level 0.05 (%) significance level 

(%) ~ 1.56x10" 5 (%) 



Disease 


FLT1 


100 


92 


PTK2 


54.5 


33.5 


liability 






















FLT1 


96.5 


97.5 


Q1 


FLT1 


100 


92 


FLT1 


96.5 


97.5 




KDR 


56.5 


0 


KDR 


54.5 


0 


Q2 


VNN1 


46.5 


0 


VNN1 


62 


0 
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Table 5 Power to detect causal genes in the GAW17 simulation model as associated with disease liability using each 
ethnic group 





Africans (n 


= 215) 




Europeans (n 


= 156) 




Asians (n 


= 321) 


Gene 


Power 
calculated at 
significance 
level 0.05 (%) 


Power calculated 
at significance 
level 1.56 x 10" 5 
(%) 


Gene 


Power 
calculated at 
significance 
level 0.05 (%) 


Power calculated 
at significance 
level 1.56 x 10" 5 
(%) 


Gene 


Power 
calculated at 
significance 
level 0.05 (%) 


Power calculated 
at significance 
level 1.56 x 10" 5 
(%) 


PIK3C3 


58 


32 


PIK3C3 
SHC1 


66.5 
42 


51.5 
29.5 


NRAS 


43 


33 



Analysis is for all SNPs. 



to detect the association, or one may need to take into 
account the dispersion parameter of variance in the 
model for consistent parameter estimation. Therefore 
further research is needed to evaluate the zero-inflated 
Poisson regression model for its ability to handle the 
various characteristics of various rare variants. 
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